Genotype imputation accuracy and the quality metrics of the minor ancestry in multi-ancestry reference panels

Abstract Large-scale imputation reference panels are currently available and have contributed to efficient genome-wide association studies through genotype imputation. However, whether large-size multi-ancestry or small-size population-specific reference panels are the optimal choices for under-represented populations continues to be debated. We imputed genotypes of East Asian (180k Japanese) subjects using the Trans-Omics for Precision Medicine reference panel and found that the standard imputation quality metric (Rsq) overestimated dosage r2 (squared correlation between imputed dosage and true genotype) particularly in marginal-quality bins. Variance component analysis of Rsq revealed that the increased imputed-genotype certainty (dosages closer to 0, 1 or 2) caused upward bias, indicating some systemic bias in the imputation. Through systematic simulations using different template switching rates (θ value) in the hidden Markov model, we revealed that the lower θ value increased the imputed-genotype certainty and Rsq; however, dosage r2 was insensitive to the θ value, thereby causing a deviation. In simulated reference panels with different sizes and ancestral diversities, the θ value estimates from Minimac decreased with the size of a single ancestry and increased with the ancestral diversity. Thus, Rsq could be deviated from dosage r2 for a subpopulation in the multi-ancestry panel, and the deviation represents different imputed-dosage distributions. Finally, despite the impact of the θ value, distant ancestries in the reference panel contributed only a few additional variants passing a predefined Rsq threshold. We conclude that the θ value substantially impacts the imputed dosage and the imputation quality metric value.


INTRODUCTION
Genotype imputation is a cost-efficient technique to expand the number of markers in genetic studies.Li and Stephen's hidden Markov model (HMM) and the pre-phasing-imputation pipeline are employed by the popular tools Minimac, IMPUTE and BEAGLE, with slightly different implementations but similar accuracy [1][2][3][4][5].
The imputation accuracy (e.g. the squared correlation between imputed dosage and true genotype; dosage r 2 ) [6], the scores generated by software without the true genotype (Rsq and INFO) [3,7,8] and the number of variants passing a predetermined Rsq threshold (high-Rsq variants) are widely used quality metrics to evaluate imputation performance [9].This performance primarily depends on the reference panel size and genetic similarity between the reference panel and target sample [10].In studies using the International HapMap Project (HapMap) and the 1000 Genomes Project (1KGP) [11,12], pooling multiple ancestries together to maximize the panel size typically improves the imputation performance compared to using only the matched ancestry [8,13,14].Deelen et al. combined the population-specific whole genome sequencing (WGS) dataset of the Genome of the Netherlands (GoNL) with the 1KGP and showed a higher dosage r 2 than when using either GoNL or 1KGP alone [15].Huang et al. combined the UK10K with the 1KGP to generate more variants with high INFO [9].Moreover, larger multi-ancestry reference panels, such as the Haplotype Reference Consortium (HRC) and the Trans-Omics for Precision Medicine (TOPMed) [16,17], greatly improved the imputation accuracy and increased the number of high-Rsq variants in European (EUR), African (AFR) and Admixed American (AMR) [10,18].
However, constructing large reference panels by combining diverse ancestries is not always beneficial.Small-size populationspecific panels, such as Norwegian and Estonian panels [19,20], have achieved similar performance to the HRC panel in EUR.Moreover, although the HRC panel includes all 1KGP samples, the imputation accuracy in non-EUR populations can be inferior to that of the 1KGP panel alone [10,21].Bai et al. found that adding 27 samples from a different ancestry to the Han Chinese (1KGP-CHB, size = 103) resulted in better imputation accuracy than using the 1KGP-CHB or 1KGP panel (size = 2504) [22].In studies of Asian populations, combining population-specific WGS datasets with the 1KGP improved the imputation accuracy in some instances [23][24][25], but not in others [26][27][28].So far, the reason for these discrepancies is unclear.
Two sets of parameters are used in the HMM-based imputation algorithms.Transition probability is governed by the template switching rate (θ) between adjacent markers, which models recombination and relatedness; emission probability is determined by the error rate (ε) for each marker, which models genotyping error, gene conversion and recurrent mutation [3,7].These parameters are crucial to the HMM for calculating matching probabilities between the reference and target haplotypes during the imputation process, but dosage r 2 is robust to these parameters [3,5].Many studies have suggested different Rsq or INFO thresholds to achieve a similar dosage r 2 [29,30], making it difficult to ascertain the relationship between the imputation process, Rsq and dosage r 2 .
In this study, we imputed the East Asian (EAS) using the TOPMed reference panel (EAS comprised 1.22% of the total samples) and found that Rsq overestimated dosage r 2 , particularly in marginal-quality bins.We introduced novel variance component analysis for Rsq and analytically investigated why Rsq was overestimated and characterized the relationship between the template switching rate used in the HMM, quality metrics and imputed dosage.Furthermore, we evaluated the θ value, Rsq, dosage r 2 , the deviation between them and the number of high-Rsq variants from matched or distant ancestry in cases where the target ancestry was the major or minor component of the multi-ancestry panels.

Subjects, genotyping and quality control
Cohort specifications of BioBank Japan (BBJ-180k), genotyping and quality control (QC) are provided in Supplementary Method 1.For imputation using the TOPMed panel, we lifted the coordinates of the genotyping array from hg19 to hg38 using LiftOver [31].Palindromic variants (reference/alternative (ref/alt) alleles were G/C or A/T) and variants not in the TOPMed freeze5b (https:// bravo.sph.umich.edu/freeze5/hg38/)were excluded.The ref/alt alleles were swapped and/or reverse-complemented according to the hg38 reference sequence.A total of 515 587 autosomal variants remained.

Pre-phasing, reference panel construction and imputation
EAGLE v2.4.1 was used for pre-phasing without an external reference [32].The 1KGP (p3v5) reference panel was downloaded from https://genome.sph.umich.edu/wiki/Minimac3.Methods used to construct the population-specific panels (named the BBJ1k and JEWEL3k) are provided in Supplementary Method 2. We used an in-house server to perform imputation using the 1KGP, BBJ1k and JEWEL3k panels.Minimac3 v2.0.3 was used to estimate the HMM parameters and prepare m3vcf files [3].Minimac4 v1.0.2 was used for imputation [33].Imputation using the TOPMed and HRC panel were performed using the TOPMed and Michigan imputation servers [3,17], respectively.

Evaluation of the imputation performance using different reference panels
Rsq from the Minimac4 info file was extracted for the subsequent analyses.Variants with Rsq <0.3 were removed [34].Imputation performance was empirically evaluated using 993 samples with WGS (named WGS 993 ).QC and processing are described in Supplementary Method 3. Coverage of a reference panel was the fraction of variants in WGS 993 that could be imputed with Rsq ≥ 0.3.Dosage r 2 was the squared Pearson correlation coefficient between the imputed dosage and true genotype (encoded as 0, 1 or 2).We grouped the variants into dosage r 2 bins with a 0.05 interval and bootstrapped each bin 1000 times to obtain the 2.5% and 97.5% quantiles (95% confidence interval; CI) of Rsq − dosage r 2 .Minor allele frequency (MAF) and minor allele count (MAC) of WGS 993 were used to stratify variants into bins.

Quantification of the deviation between Rsq and dosage r 2
We modeled the relationship between the imputed allelic dosage (y = y 1 , . . ., y n , y i ∈ [0, 1]) and the true allele (x = (x 1 , . . ., x n ), x i ∈ {0, 1}) using the simple linear regression formula: where β imp is a scalar of the regression coefficient with βimp = Cov x, y /Var (x), β 0 is the intercept and ε = ( 1 , . . ., n ) is an error term.The Minimac EmpRsq metric, which is the squared Pearson correlation coefficient between x and y (https://genome.sph.umich.edu/wiki/Minimac3_Info_File),equals the ratio between the regression sum of squares (SS reg ) and the total sum of squares (SS tot ) in this simple linear regression.Consequently, we have where Cov(x, y), Var(x) and Var(y) are the covariance between x and y, the variance of x and the variance of y, respectively.Rsq is the ratio between Var(y) and p 1 − p , where p is the alternative allele frequency (AAF) in the imputed dataset [3,7].Then, where SS res is the residual sum of squares that follows SS tot = SS reg + SS res , and n is the number of imputed haplotypes.Hence, Rsq comprises two parts: regression related and residual related.We define SSres/n p(1−p) as MAF-Adjusted-Residual-Error (MARE).Then: In Equation (7), βimp is negative if x and y are negatively correlated.We did not consider that situation.Hence, each combination of Rsq and EmpRsq indicates specific values of MARE and βimp .Hereafter, we refer to βimp as the β imp metric and MARE as the MARE metric.
Because the WGS dataset comprises unphased diploid data, dosage r 2 calculated from that will be slightly different from EmpRsq.Supplementary Note 1 provides the detailed methods to calculate MARE and β imp from haploid or diploid data, and the concordance between values obtained using Equations ( 6)-( 7) and calculated from the imputed dosage in both haploid and diploid cases.

Imputation using the simulated 1KGP reference panels with different θ values
The 1KGP (p3v5) vcf files were downloaded from the Minimac3 website (see above).Parameters were estimated using Minimac3.The θ value was extracted (denoted as 'Recom' in the m3vcf file) [3], scaled by 21 folds manually (0.01-100) and replaced in the original file.A brief explanation of the θ value is provided in Supplementary Note 2. The array data of WGS 993 , a subset of BBJ-180k, was used as the target sample.The imputation procedure was the same as described above.

Evaluation of the imputation performance using the simulated reference panels
We obtained the imputed allelic dosage (LooDosage) from the Minimac4 empiricalDose file (by turning the '-meta' option on) [35].It was derived from the leave-one-out method by hiding markers on the array during the imputation.Rsq, EmpRsq, MARE and β imp were calculated from the LooDosage and array data.We bootstrapped the metric values 1000 times to obtain the 95% CI.MAF was obtained from the array data.

Simulation of reference panels and estimation of the θ value
We sampled subsets from the 1KGP and JEWEL3k, shuff led the sample order 10 times and created new vcf files using bcftools v1.14 (https://samtools.github.io/bcftools/)[36].We estimated the parameters using Minimac3.Supplementary Method 4 provides the detailed methods.The total θ value along chr19 (by summing the θ values between adjacent markers) was used to evaluate the reference panel size and ancestral diversity impact.A discussion of the θ value qualification is provided in Supplementary Note 3.

JPT-1KGP reference panel simulation and imputation
We sampled seven subsets (size = 100, 500, 1000, 1500, 2000, 2500 and 3256) from the JPT 3256 and combined the JPT 3256 with six subsets (1KGP-JPT and one to five ancestries) of the 1KGP.WGS 993 was used as the target.The other processing methods were the same as above.

Number of confident alleles and high-Rsq variants
Haploid dosage (HDS) is the imputed alternative allele's allelic dosage at the haploid level, which could be obtained using the 'format HDS' option in Minimac4.Brief ly, HDS is obtained in a pervariant and per-individual manner, and a higher value indicates that an imputed alternative allele is more certain.We quantified the θ value's impact on the imputed dosages by the number of confident alleles (HDS > 0.9).Rsq was used to judge how many high-Rsq (Rsq > 0.7) variants could be passed to the downstream analyses.
Furthermore, we determined how many confident alleles and high-Rsq variants could be obtained only from the additional distant ancestries in the multi-ancestry reference panel.We defined ancestry-specific variants as follows: EUR-only variants only existed in the 1KGP-EUR 403 , and it has no non-EUR variants.JPT 3256 -only and 1KGP-EAS-only variants only existed in the JPT 3256 and 1KGP-EAS, respectively.Non-EAS variants were not found in the JPT 3256 or 1KGP-EAS.

Replication of the TOPMed imputation pipeline
We followed the TOPMed imputation pipeline (accessed on 15 October 2022, https://topmedimpute.readthedocs.io/)and used the 1KGP and JEWEL3k reference panels.Specifically, the HapMap2 genetic map was used as a reference for the θ value instead of estimating it using Minimac3.WGS 993 was used as the target.As the θ value was not output by default, we modified the Minimac4 source code to obtain the transformed θ value (Supplementary Method 5).

BBJ imputation using the four reference panels
The BBJ-180k was imputed using the TOPMed, 1KGP, BBJ1k and JEWEL3k reference panels (Figure 1A).Characteristics of each panel and the target sample are listed in Table 1.We categorized the imputed variants by MAF and Rsq in each imputed dataset.(C) Simulations to evaluate the relationship between the reference panel, HMM parameter, imputation metrics and imputation result.Two scenarios were simulated to represent the imputations of the major and minor ancestry in a multi-ancestry panel.The 3256 JPT WGS and 1KGP were used to simulate reference panels with different sizes and ancestral diversities.European (EUR) and JPT samples were used as the target.Imputation results were evaluated the same as in (B).In addition, we evaluated how the Minimac3's parameter estimates change with the panel composition.
With more EAS samples, more low-frequency variants (MAF < 5%) passed each Rsq threshold (Figure 2A and Supplementary Table 1).In addition to the absolute number, unique variants were imputed from each panel (Figure 2B).These results reproduced the benefits of using large and different reference panels [19].
The 155 297 shared SNVs (on chr19) imputed by the four panels were used to compare imputation accuracy.The mean dosage r 2 using the TOPMed panel was between those of 1KGP and BBJ1k (Figure 2D).However, Rsq showed 1.83-26.2times higher upward biases when dosage r 2 was <0.85 in the TOPMed imputation result (Figure 2E and Supplementary Table 2).Using all SNVs and short insertions and deletions (indels) with Rsq ≥ 0.3, although with higher coverage, the TOPMed imputation result showed an inferior dosage r 2 than 1KGP (Supplementary Figure 1).

Quantifying the deviation between Rsq and dosage r 2
The deviation between Rsq and dosage r 2 was persistent in all MAF bins only when using the TOPMed panel (Supplementary Figure 2), which indicated a potential systematic bias from the imputation pipeline or the reference panel.To validate this observation, we analytically derived the relationship between Rsq and dosage r 2 (Methods).Two novel metrics, MARE and β imp , were introduced to quantify the deviation.MARE, an MAF-adjusted form of residual error, takes a value between 0 and 1, and increases with SS res .β imp describes the distinguishability between the mean imputed dosage of each true genotype group.Rsq is the ratio between observed and expected variance and dosage r 2 shows the correlation.Under the assumption of 'well-calibration' (the posterior allele probability from imputation equals the expected true allele dose), Rsq equals dosage r 2 [10].Our analytical derivation  of the relationship between Rsq and dosage r 2 did not assume 'well-calibration' (Supplementary Figures 3-7 and Supplementary Note 1).The relationship between the imputed dosage and the true genotype determined the four metrics.Any two of Rsq, dosage r 2 , MARE and β imp could entirely quantify this relationship and determine the other two metrics (Equations ( 6)-( 7)), whereas Rsq or dosage r 2 alone could not.
To exhibit the relationships, we plotted the theoretical values of MARE and β imp on the coordinates of Rsq and dosage r 2 (Figure 3A and F) to show that the overestimated Rsq was accompanied by a higher MARE (Figure 3A).We used rs142572000 as an example (Figure 3B-E).In the TOPMed imputation result, imputed genotypes were more certain (defined as the imputed dosage closer to 0, 1 or 2) (Figure 3B) [37] compared to the other three panels (Figure 3C-E).The high certainty increased Var(y) and Rsq.In Supplementary Note 4, the positive relationship between imputed-genotype certainty and Rsq is demonstrated.As discussed below, a high certainty or Rsq did not mean that the imputation is more accurate.As shown in Figure 3B, many heterozygotes were incorrectly imputed with a dosage of approximately 0 in the TOPMed imputation, causing higher MARE and Rsq and an even lower dosage r 2 .
Variants with Rsq < dosage r 2 showed a lower β imp (Figure 3F).We used rs671 as another example (Figure 3G-J).Except for the TOPMed imputation result, imputed dosages were shrunk to the AAF in EAS (0.247) (Figure 3H-J), causing uncertainty in the imputed genotype, and decreased β imp , MARE and Rsq.However, imputed dosages were still highly correlated with the true genotypes (Figure 3F), causing dosage r 2 > Rsq.Taking the two examples together, imputation results of a similar Rsq or dosage r 2 may have a great difference in the imputed dosage; thus, more comprehensive evaluations are necessary and the deviation between Rsq and dosage r 2 should not be ignored.Supplementary Note 5 describes the selection criteria for rs142572000 and rs671.To evaluate the generalizability, we presented the three other examples (rs1047781, rs113230003 and rs7624610) reported by genome-wide association studies and compared Rsq − dosage r 2 to the imputed-genotype certainty and showed that in dosage r 2 bins 0.6-1, the imputed-genotype certainty increased as the deviation increased (Supplementary Note 5, Supplementary Tables 3-4 and Supplementary Figure 8).
We categorized MARE into Rsq bins and β imp into dosage r 2 bins to compare them between reference panels and to the expected values when assuming Rsq equals dosage r 2 .The mean MARE of the TOPMed result was above the expected value for Rsq bins 0.35-0.9,and the mean β imp of the 1KGP result was below those of the other panels (Supplementary Figure 9).As demonstrated above and in Supplementary Notes 4-5, these metrics could reveal the imputed dosage distribution.Thus, these findings indicated that the TOPMed result was more certain at particular dosage r 2 bins and might contain more wrongly imputed genotypes compared to that expected from a deviation-free imputation result (Figure 3B and G).

Template switching rate impacts the deviation between Rsq and dosage r 2
The high certainty suggests an overconfident matching between the reference panel and target sample.We investigated how the template switching rate (θ) used in the HMM affects the imputedallele certainty using the 1KGP reference panel and 21 scalings of the θ value (0.01-100-fold of that estimated by Minimac3; Methods; Figure 1B).Low θ values caused a deviation toward Rsq > EmpRsq from the 45-degree line when comparing the variants on chr19 (EmpRsq is the alternate of dosage r 2 ; Methods) (Supplementary Figure 10A-C), while high θ values caused Rsq < EmpRsq (Supplementary Figure 10E-G).Using rs10410162 as an example (Supplementary Note 6 describes the selection criteria), Rsq and MARE increased as the θ value decreased (Figure 4A), with increasing the certainty of the imputed alleles (Supplementary Figure 11A-C) and deviation toward Rsq > EmpRsq (Figure 4B).As the θ value increased, the imputed allelic dosages were shrunk to the AAF (0.351) (Supplementary Figure 11E-G), resulting in a more drastic decrease in Rsq than EmpRsq (Figure 4A) and thereby causing EmpRsq > Rsq.Notably, EmpRsq and β imp were roughly maintained unless the θ value was scaled up by 2-fold or higher (Figure 4A and C), suggesting EmpRsq was insensitive to altering the θ value, particularly the downscaling.In contrast, Rsq was sensitive to the θ value and consistently increased with downscaling of the θ value, leading to the deviation between Rsq and EmpRsq.Such observations were verified using the same variants (rs1047781, rs113230003 and rs7624610) mentioned above (Supplementary Note 6, Supplementary Table 5 and Supplementary Figure 12).Furthermore, the relationship between the deviation and the imputed-allele certainty was verified using all variants on chr19 (Supplementary Note 6 and Supplementary Table 6).
We also showed that low θ values increased the imputed-allele certainty using a randomly selected target haplotype (Figure 4D and Supplementary Figure 13).High certainty might cause some variants to be wrongly imputed as the opposite allele (Supplementary Figure 14).This is explained further in Supplementary Note 7.Such imputed-dosage properties were also revealed by the mean MARE and β imp across all variants on chr19 (Supplementary Figure 15).Taking the metric values and imputed dosages together, the simulated imputation using an extremely low θ value mimics the high certainty and Rsq overestimation in the TOPMed imputation result in EAS (Figure 3).

Template switching rate impacts the imputation performance
Using the modified 1KGP panels, although the mean EmpRsq was insensitive to the scaling of the θ value (maximum difference of 0.034 for the θ value scaling between 0.01-and 2-fold; Figure 5), low θ value increased Rsq (Figure 5).We evaluated the number of confident alleles (HDS > 0.9) and high-Rsq variants (Rsq > 0.7) (Methods).Downscaling the θ value increased the number of confident alleles and high-Rsq variants (Supplementary Table 7).When the θ value was 0.5-fold, the number of confident alleles and high-Rsq variants increased by 5.46 and 8.89%, respectively, and if the θ value was 2-fold, these numbers decreased by 11.8 and 13.8%, respectively.These results indicated that the θ value shaped the imputed allelic dosage and changed Rsq and the number of high-Rsq variants while leaving EmpRsq almost the same.

Reference panel and θ estimates
We evaluated how Minimac3's parameter estimation changed with the composition of the reference panel (Supplementary Method 4, Supplementary Figure 16, and Supplementary Note 3).The θ estimates decreased with the sample size of a single ancestry and increased with ancestral diversity when size was fixed (Figure 6A and B).On pooling samples from different ancestries together at a small panel size, the θ estimates still decreased as the panel size increased (Figure 6C); however, at a larger panel size, it increased with simultaneously increasing the panel size and ancestral diversity (Figure 6D), suggesting that the θ estimates were in a trade-off between the panel size and ancestral diversity (Figure 6E).Supplementary Note 8 explains these effects.

Fitness of the θ value, deviation and imputation performance
To elucidate the multi-ancestry reference panel's impact on the imputation result, we simulated two scenarios using the multiancestry reference panels: the target sample was from the (1) minor and (2) major ancestry (Figure 1C).

Scenario 1: the target sample was from the minor ancestry
We simulated 8 EUR-EAS reference panels and used 100 EUR samples as the target (Methods).As the panel size increased and θ value decreased (Table 2), Rsq and MARE were upwardly biased, as expected (Supplementary Figures 17 and 18).The mean EmpRsq decreased consistently with the addition of EAS samples to the reference panel (Table 2).However, only a marginal difference was observed (maximum difference of 0.020).
There were 114 606 EUR-only and 696 017 non-EUR variants (Methods).As the panel size increased and θ value decreased, the number of confident alleles increased from 84 797 to 101 116 and from 0 to 458 for EUR-only and non-EUR variants, respectively (Supplementary Table 8).The number of high-Rsq variants increased from 163 468 to 200 161, 18 646 to 27 941 and 0 to 521 for all, EUR-only and non-EUR variants, respectively (Supplementary Table 8).These results revealed that a lower θ value increased the number of confident alleles and high-Rsq variants, without improving the EmpRsq (Supplementary Table 8).Meanwhile, even if the reference panel comprised 90.3% EAS samples, non-EUR variants only comprised 0.26% of the total variants in the imputation result (when setting a cutoff of Rsq > 0.7).The majority of variants gained from the larger panel when imputing the underrepresented ancestry was because of the lower θ value.

Scenario 2: the target sample was from the major ancestry
We simulated 13 JPT-1KGP reference panels and used WGS 993 as the target (Methods).When only JPT samples were in the panel, the mean EmpRsq and Rsq increased with the panel size, as expected (Supplementary Figure 19).When combined with the 1KGP subsets, the mean EmpRsq was highest when using JPT 3256 + 1KGP-EAS for variants with MAF ≥ 1% and JPT 3256 + 1KGP-JPT for variants with 1% > MAF ≥ 0.5% (Table 3).Adding other ancestries decreased the mean EmpRsq marginally (maximum difference < 0.01 for all MAF categories).The mean Rsq was the highest when using JPT 3256 and decreased with the addition of more ancestries, with maximum differences of 0.007, 0.024 and 0.048 for variants with MAF ≥ 5%, 5% > MAF ≥ 1%, and 1% > MAF ≥ 0.5%, respectively (Table 3).None of the imputation results showed a noticeable deviation in the MARE and β imp .The 3256 WGS samples of BBJ are included in all panels.1KGP-followed by the ancestry represents the 1KGP subset.Minor allele frequency (MAF) is determined by the genotyping array.
However, the MARE and β imp of the JPT 3256 , JPT 3256 + 1KGP-JPT and JPT 3256 + 1KGP-EAS results were closer to the expected values (Supplementary Figure 20), while adding other ancestries made MARE and β imp correspond to the condition of using a higher θ value.There were 74 490 JPT 3256 -only, 7627 1KGP-EAS-only and 495 489 non-EAS variants (Methods).From JPT 3256 to JPT 3256 + 1KGP, the number of confident alleles decreased from 24 599 to 23 666, increased from 0 to 131 and increased from 0 to 254 for these three groups of variants, respectively (Supplementary Table 9).The number of high-Rsq variants decreased from 274 343 to 264 221 (a decrease of 3.69%) when using JPT 3256 or JPT 3256 + 1KGP (Supplementary Table 9).Only 265 non-EAS variants reached an Rsq > 0.7 when using JPT 3256 + 1KGP, which was 0.10% of the total variants.Hence, combining with the 1KGP caused fewer confident alleles and high-Rsq variants to remain in the imputation results, as expected from using higher θ values.Both scenarios indicated that the distant ancestry in the reference panel affected the θ estimates and the number of high-Rsq variants.However, the expanded haplotypes and variant sets from distant ancestries provided only a few additional variants.

EAS imputation using the public reference panels
We used the TOPMed imputation pipeline (Methods) and observed an upward deviation of Rsq, particularly when using the 1KGP panel (Supplementary Figure 21).The θ value transformed from the genetic map was 0.15-and 0.27-fold of that estimated by Minimac3 when using the 1KGP and JEWEL3k panels, respectively (Supplementary Table 10).
As the θ estimates decreased with the size of the major ancestry (Figure 6C), it would be underestimated for the minor ancestry in the multi-ancestry panel regardless of using the genetic map or estimating it by the Mimimac3.One example was the HRC panel, which did not use the genetic map but induced Rsq overestimation in the BBJ-180k (Supplementary Figure 22).Thus, caution may be required when imputing the under-represented population using large ancestry-imbalanced panels under the current framework.

DISCUSSION
The deviation between Rsq and dosage r 2 in the imputation result is raised from the imputed dosage, and has been widely observed using different reference panels and software [9,15,38].One reason for this observation was that the θ value used in the HMM does not correspond among the panel size, ancestral components in the reference panel and target population.When using the multiancestry reference panel, distant ancestries affect the θ estimates (in Minimac3/4), followed by the imputed dosage, dosage r 2 , Rsq and deviation.Moreover, the addition of distant ancestries only contributes a few high-Rsq variants, suggesting that these reference haplotypes have been assigned a low probability in the HMM.The subsetting of closely related samples from the reference panel has been adopted by the IMPUTE software; however, IMPUTE has been reported to produce an INFO score deviating from dosage r 2 , indicating that the construction of single-ancestry reference panel is still the optimal choice under the current HMM framework.
Our simulations indicate that the lower θ value used by a large multi-ancestry panel could increase the imputed-genotype certainty but not dosage r 2 .As Rsq is a measurement of certainty and is not related to the true genotype, using a multiancestry panel may lead to confusing benchmarking results and increase the chance of false positives in association tests.Ferwerda et al. reported height and body mass index association signals in an ethnically diverse cohort only when imputing against the TOPMed panel [39].Bai et al. reported the highest Rsq but lowest dosage r 2 when imputing the Han Chinese using the HRC panel, compared to the 1KGP and a population-specific panel [22], similar to our observations in the Japanese population.Dosage r 2 determines the association test power compared to using the true genotype [10].When using Rsq as an indicator of dosage r 2 , which indicates the power, the deviation may affect the optimal choices of downstream analyses [40].Beside dosage r 2 and Rsq, imputation has been reported to cause variability in genetic score calculations [41], which may further affect downstream analyses that require genotype aggregation, like polygenic score estimation and transcriptome-wide association studies [42].Such implications suggest that a thorough examination of the imputation result is warranted when using the multi-ancestry reference panel.In addition to simply comparing dosage r 2 or Rsq, we have detailed the changes in these metrics, θ estimates, imputeddosage certainty and the number of high-Rsq variants passing to downstream analyses.Our findings were further validated and confirmed by replicating all analyses using SNPs on chromosome 20 (Supplementary Note 9 and Supplementary Tables 11-20).Furthermore, we have provided a script (https://github.com/shimaomao26/impumetric) to allow users to check their results using the leave-one-out imputation of Minimac4 (external WGS not required) or additional WGS data.
The TOPMed imputation pipeline used the θ value transformed from the HapMap2 genetic map by assuming 0.01 centimorgan (cM) corresponds to a 1% switching rate.The switching rate would thereby be fixed given the variant's base pair position and centimorgan.Two reasons may explain why this method works well.First, the genetic map did not significantly change the imputation accuracy, as verified in a Finnish study [43].We also showed that dosage r 2 was insensitive to the θ value.Second, the θ value was only underestimated for the EAS (size = 1184), but the sizes of EUR, AFR and AMR (size = 17 085-47 159) might fit this value.Previous studies have reported that Rsq in the TOPMed imputation result was sometimes misleadingly high in AFR [38] and EUR [44].A recent paper reported that the TOPMed panel was more robust to the low-density genotyping array than the HRC and 1KGP panels [18].These results also indicated that a fixed low θ value might be used by the TOPMed pipeline, as we have inferred.Other imputation tools, such as IMPUTE and BEAGLE, use fixed recombination rates predetermined by the HapMap2 genetic map.Although we did not explicitly evaluate different tools in this study, the INFO score has overestimated dosage r 2 in multiple studies [9,29,30], indicating that the relationship between this parameter and the imputation result is not software or reference panel specific.Further studies are warranted to comprehensively elucidate the impacts on genetic studies.
Our results suggest that avoiding ancestral diversity is best when more than 3000 WGS samples are available to construct a JPT reference panel.Increased diversity would then only have a marginal impact on dosage r 2 but would affect the individual's imputed dosage and cause fewer variants to pass a predetermined Rsq filter.Zhang et al. and Cong et al. similarly observed that adding the 1KGP to about 3000 Chinese WGS samples would neither benefit nor harm dosage r 2 [26,27].We focused on explaining the reason underlying this observation in this work.Further studies are warranted to study the algorithm implementation and parameter estimation and to take advantage of combining population-specific and public WGS datasets while avoiding our identified problems.
Our study has several limitations.First, our simulations of the reference panel were study specific.As discussed, the θ estimates decreased with the panel size and increased with the ancestral diversity.However, a larger panel typically increases the diversity simultaneously.Therefore, the θ value and imputation result need to be discussed case by case.This limitation also implies that the general experience may not work well for a new reference panel and target dataset.Second, in simulations of multi-ancestry reference panels (scenarios 1 and 2), WGS datasets were merged using IMPUTE2 [9].We did not investigate the impact of IMPUTE2 but treated the merged dataset similar to the WGS dataset, possibly, underestimating the number of high-Rsq variants from the distant ancestry (Supplementary Note 10).Our results using the 1KGP (not modified by IMPUTE2) also revealed that only a limited number of confident alleles and high-Rsq variants in the imputation result were from the distant ancestry (Supplementary Note 10 and Supplementary Table 7).Therefore, the conclusion that distant ancestry in the reference panel affects the θ estimates rather than providing high-Rsq variants would be valid.However, further studies should be conducted to determine the impact of the panel-merging method and the net gain of variants from distant ancestries.Third, we did not consider the phasing error in the reference panel and target sample.Further studies are warranted to evaluate the HMM parameter's impact on phasing accuracy and the different combinations of pre-phasing and imputation method.
In summary, we explained that the HMM parameter could be a potential reason for inaccurate Rsq and inferior dosage r 2 when using large multi-ancestry reference panels.This is also the first study of the relationship between the template switching rate, imputed-genotype certainty, Rsq and dosage r 2 .We envision that our methods and conclusions could provide insights for benchmarking studies, construction of reference panels, and development of imputation algorithms and pipelines in the future.

CONCLUSION
The relationship between the reference panel, the template switching rate (θ value), the imputed dosage, and the deviation between Rsq and dosage r 2 are summarized in Table 4.When a multi-ancestry reference panel is used, dosage r 2 is insensitive to a range of θ values, while Rsq increases with a lower θ value.This could create a deviation between Rsq and dosage r 2 .For under-represented populations in large multi-ancestry reference panels, the majority gain of additional variants with high Rsq is not from the additional reference haplotypes but is rather because of the low θ value used and Rsq overestimation.On the other hand, for the major ancestry in the reference panel, the high θ value causes fewer variants to remain in the Rsq-filtered dataset.Our findings suggest utilizing only the matched single-ancestry reference panel, and avoiding benchmarking using only Rsq or dosage r 2 .

Key Points
• The template switching rate (θ value) used in the hidden Markov model changes the imputed-genotype certainty and may cause bias in the standard quality metric (Rsq).• The θ value estimation depends on the reference haplotype's similarity.It decreases with the size of a single ancestry and increases with ancestral diversity.• A low θ value will be generated and used for the underrepresented population in a multi-ancestry reference panel.• In the imputation of the under-represented population using the large multi-ancestry reference panels, the majority gain of additional variants with high Rsq is caused by the low θ value used.

Figure 1 .
Figure 1.Overview of this study.(A) Imputation using the TOPMed, 1KGP, BBJ1k (1037 Japanese (JPT) WGS + 1KGP) and JEWEL3k (3256 JPT WGS + 1KGP) reference panels.The 180k EAS samples in the BBJ first cohort were imputed against the four panels.Samples with whole genome sequencing (WGS 993 ) were used to empirically evaluate the imputation performance, including a count of imputed variants and estimated (Rsq) and true (dosage r 2 ) imputation accuracy.The deviation between Rsq and dosage r 2 was examined and quantified analytically.Two novel metrics, MARE and β imp , were introduced to indicate the different imputed-dosage distributions underlying the deviation.(B) Simulations to evaluate the relationship between the HMM parameter, Rsq, dosage r 2 , the deviation between Rsq and dosage r 2 and imputed dosage.The reference panel and target sample were fixed to the 1KGP and WGS 993 , and 21 scaled template switching rates (the θ value) were used for running imputations.Evaluations were the same as in (A).(C)Simulations to evaluate the relationship between the reference panel, HMM parameter, imputation metrics and imputation result.Two scenarios were simulated to represent the imputations of the major and minor ancestry in a multi-ancestry panel.The 3256 JPT WGS and 1KGP were used to simulate reference panels with different sizes and ancestral diversities.European (EUR) and JPT samples were used as the target.Imputation results were evaluated the same as in (B).In addition, we evaluated how the Minimac3's parameter estimates change with the panel composition.

Figure 2 .
Figure 2. Imputation results using the TOPMed, 1KGP, BBJ1k and JEWEL3k reference panels.(A) Count of imputed variants using the four panels, stratified by minor allele frequency (MAF) and Rsq in each imputation result.(B) Count of unique and shared variants (with Rsq ≥ 0.3) using the TOPMed, 1KGP and JEWEL3k panels.(C) Fraction of SNV and indel in WGS 993 that could be imputed with Rsq ≥ 0.3 (coverage) using each panel, stratified by MAF of WGS 993 .(D) The mean dosage r 2 and Rsq, stratified by MAF of WGS 993 .(E) Comparison of dosage r 2 and Rsq in the imputation results using the four panels.The scatter represents the variants, with the density shown by the contour lines.The diagonal line shows that Rsq equals dosage r 2 , and the histograms on the side show the distribution.In (C), (D) and (E), variants on chromosome 19 were used.In (D) and (E), overlapping variants imputed by the four panels (Rsq ≥ 0.3 using all panels) were used.

Figure 3 .
Figure 3. MARE, β imp , Rsq, dosage r 2 and imputed dosage of rs142572000 and rs671.(A) The plot shows Rsq and dosage r 2 of rs142572000, and the theoretical values of MARE.(B-E) The plot shows the imputed dosage of rs142572000 in the imputation result using the (B) TOPMed, (C) 1KGP, (D) BBJ1k and (E) JEWEL3k reference panels.(F) The plot shows Rsq and dosage r 2 of rs671, and the theoretical values of β imp .(G-J) The plot shows the imputed dosage of rs671 in the imputation result using the (G) TOPMed, (H) 1KGP, (I) BBJ1k and (J) JEWEL3k panels.In (A) and (F), the diagonal line shows that Rsq equals dosage r 2 , and the color keys show the theoretical MARE or β imp values.The dashed lines represent a value difference of 0.1 of MARE or β imp .The arrow indicates similar Rsq but different dosage r 2 (rs142572000), or similar dosage r 2 but different Rsq (rs671) between TOPMed and JEWEL3k imputation result.In (B-E) and (G-J), the strip plot shows the individual and the violin plot shows the distribution.The regression line is between the imputed dosage and true genotype.MARE and β imp values are denoted below each plot.

Figure 4 .
Figure 4. Changes in the metrics value and the imputed allelic dosage with different scalings of the θ value.(A) Rs10410162 is used as an example to visualize the changes in Rsq, EmpRsq, MARE and β imp (y-axis) with the 21 scalings of the θ value (x-axis).To have a better illustration, the x-axis is in the minus log scale.The scalings of 100-, 10-, 2-, 1-, 0.5-, 0.1-and 0.01-fold are circled, marked with vertical dashed lines, and labeled.The other scaling folds are shown as the scatters, with the color key denoting the fold in a log scale.(B-C) Rs10410162 is used as an example to visualize the deviation between Rsq, EmpRsq and MARE (B) or β imp (C).The diagonal line shows that Rsq equals dosage r 2 , and the color keys show the theoretical MARE or β imp values.The dashed lines represent a value difference of 0.05 of MARE or β imp .The 21 scatters are the same as (A).(D) The plot shows the imputed allelic dosage of a randomly selected target haplotype (chr19) with the seven scalings of the θ value.The scatters show variants.The boxes show the median, upper (75%) and lower (25%) quartiles.The whiskers show the 1.5-fold of the interquartile range (IQR) extended from the upper or lower quartile if a value exceeds them; otherwise, they show the maximum or minimum value.Only variants with imputed allelic dosage <0.5 and 0.3 < EmpRsq < 0.8 when using the 1-fold θ value are shown.

Figure 5 .
Figure 5. Rsq and EmpRsq using the 1KGP panel with different scalings of the θ value.The plot shows the mean Rsq and EmpRsq of each imputation result, stratified by the MAF of the array.The scalings of 100-, 10-, 2-, 1-, 0.5-, 0.1-and 0.01-fold are marked with vertical dashed lines and labeled.

Figure 6 .
Figure 6.Total θ value of the simulated reference panels.The figure shows the total θ value of simulated reference panels, estimated by Minimac3.(A) JPT-only panels with different sample sizes.(B) Fixed-size panels (504) with one to five levels of ancestral diversity, as shown in the legend.(C) Adding large-size (504-3760) EAS samples to a small-size (403) EUR panel.(D) Adding 1KGP samples to a large-size (3256) JPT panel.(E) Simultaneously increasing the size and ancestral diversity using one to five ancestries in the 1KGP.In each plot, the scatters show the values of 10 runs.The boxes show the median, upper (75%) and lower (25%) quartiles.The whiskers show the 1.5-fold IQR extended from the upper or lower quartile if a value exceeds them; otherwise, they show the maximum or minimum value.In (C), EURn403 represents the 403 EUR.In (C) and (D), 500-3256JPT represents the number of JPT samples in the panel.

Table 1 :
Characteristics of the reference panels and target samples used in this study, and the number of variants imputed

Table 2 :
Reference panel description, the total θ value estimated by Minimac3, the mean EmpRsq and the mean Rsq using the simulated EUR-EAS reference panels with different sizes and EUR proportions, and 100 EUR as the target sample EURn403 represents the 403 EUR; 1KGP-EAS represents the 504 EAS; 500-3256JPT represents the number of JPT samples in the reference panel.Minor allele frequency (MAF) is determined by the genotyping array.

Table 3 :
Reference panel description, the total θ value estimated by Minimac3, the mean EmpRsq and the mean Rsq using the simulated JPT-1KGP reference panels with different sizes and ancestral diversity, and WGS 993 as the target sample

Table 4 :
Impact on the imputation result when the target ancestry is the major or minor component in the multi-ancestry reference panel